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Abstract 

Face alignment aims to estimate the locations of a set 
of landmarks for a given image. This problem has received 
much attention as evidenced by the recent advancement in 
both the methodology and performance. However, most 
of the existing works neither explicitly handle face images 
with arbitrary poses, nor perform large-scale experiments 
on non-frontal and profile face images. In order to address 
these limitations, this paper proposes a novel face align¬ 
ment algorithm that estimates both 2D and 3 D landmarks 
and their 2D visibilities for a face image with an arbitrary 
pose. By integrating a 3D deformable model, a cascaded 
coupled-regressor approach is designed to estimate both the 
camera projection matrix and the 3D landmarks. Further¬ 
more, the 3D model also allows us to automatically esti¬ 
mate the 2D landmark visibilities via surface normals. We 
gather a substantially larger collection of all-pose face im¬ 
ages to evaluate our algorithm and demonstrate superior 
performances than the state-of-the-art methods. 

1. Introduction 

This paper aims to advance face alignment in aligning 
face images with arbitrary poses. Face alignment is a pro¬ 
cess of applying a supervised learned model to a face im¬ 
age and estimating the location of a set of facial landmarks, 
such as eye corners, mouth corners, etc [8], Face alignment 
is a key module in the pipeline of most facial analysis algo¬ 
rithms, normally after face detection and before subsequent 
feature extraction and classification. Therefore, it is an en¬ 
abling capability with a multitude of applications, such as 
face recognition [28], expression recognition [2], etc. 

Given the importance of this problem, face alignment 
has been studied extensively since Dr. Cootes’ Active Shape 
Model (ASM) in the early 1990s [8], Especially in recent 
years, face alignment has become one of the most pub¬ 
lished subjects in top vision conferences [1, 18, 26, 32, 33, 
35, 40], The existing approaches can be categorized into 
three types: Constrained Local Model (CLM)-based ap- 



Figure 1: Given a face image with an arbitrary pose, our pro¬ 
posed algorithm automatically estimate the 2D location and vis¬ 
ibilities of facial landmarks, as well as 3 D landmarks. The dis¬ 
played 3D landmarks are estimated for the image in the center. 
Green/red points indicate visible/invisible landmarks. 

proach (e.g., [8, 23]), Active Appearance Model (AAM)- 
based approach (e.g., [16, 19]) and regression-based ap¬ 
proach (e.g., [5,27]), and an excellent survey of face align¬ 
ment can be found in [30]. 

Despite the continuous improvement on the alignment 
accuracy, face alignment is still a very challenging problem, 
due to the non-frontal face pose, low image quality, occlu¬ 
sion, etc. Among all the challenges, we identify the pose 
invariant face alignment as the one deserving substantial re¬ 
search efforts, for a number of reasons. First, face detection 
has substantially advanced its capability in detecting faces 
in all poses, including profiles [39], which calls for the sub¬ 
sequent face alignment to handle faces with arbitrary poses. 
Second, many facial analysis tasks would benefit from the 
robust alignment of faces at all poses, such as expression 
recognition and 3D face reconstruction [21]. Third, there 
are very few existing approaches that can align a face with 
any view angle, or have conducted extensive evaluations on 
face images across ±90° yaw angles [37, 44], which is a 
clear contrast with the vast face alignment literature [30], 

Motivated by the needs to address the pose variation, and 
the lack of prior work in handling poses, as shown in Fig. 1, 
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Table 1: The comparison of face alignment algorithms in pose handling (Estimation error may has different definition.). 


Method 

3D 

landmark 

Visibility 

Pose-related database 

Pose 

range 

Training 
face # 

Testing 
face # 

Landmark 

# 

Estimation 

error 

RCPR [4] 

No 

Yes 

COFW 

frontal w. occlu. 

1,345 

507 

19 

8.5 

CoR [38] 

No 

Yes 

COFW; LFPW-O; Helen-O 

frontal w. occlu. 

1,345; 468; 402 

507; 112; 290 

19; 49;49 

8.5 

TSPM [44] 

No 

No 

AFW 

all poses 

2,118 

468 

6 

11.1 

CDM [37] 

No 

No 

AFW 

all poses 

1,300 

468 

6 

9.1 

OSRD [32] 

No 

No 

MVFW 

< ±40° 

2,050 

450 

68 

N/A 

TCDCN [42] 

No 

No 

AFLW, AFW 

< ±60° 

10,000 

3,000; ~313 

5 

8.0; 8.2 

PIFA 

Yes 

Yes 

AFLW. AFW 

all poses 

3,901 

1, 299; 468 

21,6 

6.5; 8.6 


this paper proposes a novel regression-based approach for 
pose-invariant face alignment, which aims to estimate the 
2D and 3 D location of face landmarks, as well as their vis¬ 
ibilities in the 2D image, for a face image with arbitrary 
pose (e.g., ±90° yaw). By extending the popular cascaded 
regressor for 2D landmark estimation, we learn two regres¬ 
sors for each cascade layer, one for predicting the update 
for the camera projection matrix, and the other for predict¬ 
ing the update for the 3D shape parameter. The learning 
of two regressors is conducted alternatively with the goal 
of minimizing the difference between the ground truth up¬ 
dates and the predicted updates. By assuming the 3D sur¬ 
face normal of 3D landmarks, we can automatically esti¬ 
mate the visibilities of their 2D projected landmarks by in¬ 
specting whether the transformed surface normal has a pos¬ 
itive 2 value, and these visibilities are dynamically incor¬ 
porated into the regressor learning such that only the local 
appearance of visible landmarks contribute to the learning. 
Finally, extensive experiments are conducted on a large sub¬ 
set of AFLW dataset [ 1 5] with a wide range of poses, and 
the AFW dataset [44], with the comparison with a num¬ 
ber of state-of-the-art methods. We demonstrate superior 
2D alignment accuracy and quantitatively evaluate the 3D 
alignment accuracy. 

In summary, the main contributions of this work are: 

• To the best of our knowledge, this is the first face align¬ 
ment that can estimate 2D/3D landmarks and their vis¬ 
ibilities for a face image with an arbitrary pose. 

• By integrating with a 3D deformable model, a cas¬ 
caded coupled-regressor approach is developed to es¬ 
timate both the camera projection matrix and the 3D 
landmarks, where one benefit of 3D model is the au¬ 
tomatically computed landmark visibilities via surface 
normals. 

• A substantially larger number of non-frontal view face 
images are utilized in evaluation with demonstrated su¬ 
perior performances than state of the art. 

2. Prior Work 

We now review the prior work in generic face alignment, 
pose-invariant face alignment, and 3D face alignment. 


The first type of face alignment approach is based on 
Constrained Focal Model (CFM), where the first example is 
the ASM [8], The basic idea is to learn a set of local appear¬ 
ance models, one for each landmark, and the decision from 
the local models are combined with a global shape model. 
There are generative or discriminative [9] approaches in 
learning the local model, and various approaches in utiliz¬ 
ing the shape constraint [1]. While the local models are 
favored for higher estimation precision, it also creates dif¬ 
ficulty for alignment on low-resolution images due to lim¬ 
ited local appearance. In contrast, the AAM method [7, 19] 
and its extension [ 1 7, 22] learn a global appearance model, 
whose similarity to the input image drives the estimation of 
the landmarks. While the AAM is known to have difficulty 
with unseen subjects [II], the recent development has sub¬ 
stantially improved its generalization capability [25]. Mo¬ 
tivated by the Shape Regression Machine [43] in the med¬ 
ical domain, cascaded regressor-based methods have been 
very popular in recent years [5,27]. On one hand, the se¬ 
ries of regressors progressively reduce the alignment er¬ 
ror and lead to a high accuracy. On the other hand, ad¬ 
vanced feature learning also renders ultra-efficient align¬ 
ment procedures [14, 20], Other than the three major types 
of algorithms, there are also recent works based on graph- 
model [44] and deep learning [42], 

Despite the explosion of methodology and efforts on 
face alignment, the literature on pose-invariant face align¬ 
ment is rather limited, as shown in Tab. 1 . There are four 
approaches explicitly handling faces with a wide range of 
poses. Zhu and Ranaman proposed the TSPM approach 
for simultaneous face detection, pose estimation and face 
alignment [44]. An AFW dataset of in-the-wild faces with 
all poses is labeled with 6 landmarks and used for experi¬ 
ments. The cascaded deformable shape model (CDM) is a 
regression-based approach and probably the first approach 
claiming to be “pose-free” [37], therefore it is the most rel¬ 
evant work to ours. However, most of the experimental 
datasets contain near-frontal view images, except the AFW 
dataset with improved performance than [44]. Also, there 
is no visibility estimation of the 2D landmarks. Zhang et 
al. develop an effective deep learning based method to esti¬ 
mate 5 landmarks. While accurate results are obtained, all 
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Figure 2: Overall architecture of our proposed PIFA method, with three main modules (3D modeling, cascaded coupled-regressor 
learning, and 3D surface-enabled visibility estimation). Green/red arrows indicate surface normals pointing toward/away from the camera. 


testing images appear to be within ~±60° so that all 5 land¬ 
marks are visible and there is no visibility estimation. The 
OSRD approach has the similar experimental constraint in 
that all images are within ±40° [32]. Other than these four 
works, the work on occlusion-invariant face alignment are 
also relevant since non-frontal faces can be considered as 
one kind of occlusions, such as RCPR [4] and CoR [38]. 
Despite being able to estimate visibilities, neither methods 
have been evaluated on faces with large pose variations. Fi¬ 
nally, all aforementioned methods in this paragraph do not 
explicitly estimate the 3D locations of landmarks. 

3D face alignment aims to recover the 3D location of fa¬ 
cial landmarks given a 2D image [12,29], There is also a 
very recently paper on 3D face alignment from videos [13]. 
However, almost all methods take near-frontal-view face 
images as input, while our method can make use of all pose 
images. A relevant but different problem is 3D face re¬ 
construction, which recovers the detailed 3 D surface model 
from one image, multiple images, and an image collec¬ 
tion [10,24], Finally, 3D face model has been used in as¬ 
sisting 2D face alignment [31]. However, it has not been 
explicitly integrated into the powerful cascaded regressor 
framework, which is the one of the main technical novelties 
of our approach. 

3. Pose-Invariant 3D Face Alignment 

This section presents the details of our proposed Pose- 
Invariant 3D Face Alignment (PIFA) algorithm, with em¬ 
phasis on the training procedure. As shown in Fig. 2, we 
first learn a 3D Morphable Model (3DMM) from a set of 
labeled 3D scans, where a set of 2D landmarks on an im¬ 
age can be considered as a projection of a 3DMM instance 
(i.e., 3D landmarks). For each 2D training face image, 
we assume that there exists the manual labeled 2D land¬ 
marks and their visibilities, as well as the corresponding 3 D 
ground truth - 3D landmarks and the camera projection ma¬ 


trix. Given the training images and 2D/3D ground truth, 
we train a cascaded coupled-regressor that is composed of 
two regressors at each cascade layer, for the estimation of 
the update of the 3DMM coefficient and the projection ma¬ 
trix respectively. Finally, the visibilities of the projected 
3D landmarks are automatically computed via the domain 
knowledge of the 3D surface normals, and incorporated into 
the regressor learning procedure. 

3.1. 3D Face Modeling 

Face alignment concerns the 2D face shape, represented 
by the location of N 2D landmarks, i.e.. 
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Similar to the prior work [31], a weak perspective model is 
assumed for the projection, 

U = MS, (3) 

where M is a 2 x 4 projection matrix with six degrees of 
freedom (yaw, pitch, row, scale and 2D translation). 

Following the basic idea of 3DMM [3], we assume a 3D 
face shape is an instance of the 3DMM, 

N e 

S = S 0 + 5>S ij (4) 

1=1 




























where So and S, is the mean shape and ith shape basis of 
the 3DMM respectively, N s is the total shape bases, and 
Pi is the /th shape coefficient. Given a dataset of 3D scans 
with manual labels on N 3D landmarks per scan, we first 
perform procrustes analysis on the 3D scans to remove the 
global transformation, and then conduct Principal Compo¬ 
nent Analysis (PCA) to obtain the So and {S,} [3] (see the 
top-left part of Fig. 2). 

The collection of all shape coefficients p = 
(Pi!P' 2 i ''' , Pn s ) is termed as the 3 D shape parame¬ 
ter of an image. At this point, the face alignment for a 
testing image I has been converted from the estimation of 
U to the estimation of P = {M, p}. The conversion is 
motivated by a few factors. First, without the 3D modeling, 
it is very difficult to model the out-of-plane rotation, which 
has a varying number of landmarks depending on the 
rotation angle. Second, as pointed out by [31], by only 
using 7 of the number of the shape bases, 3DMM can have 
an equivalent representation power as its 2D counterpart. 
Hence, using 3D model might lead to a more compact 
representation of unknown parameters. 

Ground truth P Estimating P for a testing image implies 
the existence of ground truth P for each training image. 
However, while U can be manually labeled on a face im¬ 
age, P is normally unavailable unless a 3D scan is captured 
along with a face image. Therefore, in order to leverage the 
vast amount of existing 2D face alignment datasets, such as 
the AFLW dataset [15], it is desirable to estimate P for a 
face image and use it as the ground truth for learning. 

Given a face image I, we denote the manually labeled 
2D landmarks as U and the landmark visibility as v, an 
A-dim vector with binary elements indicating visible ( 1 ) or 
invisible (0) landmarks. Note that for invisible landmarks, 
it is not necessary to label their 2D location. We define the 
following objective function to estimate M and p. 


J(M,p) = 


N s 


M S 0 + Si - U 0 V 


2—1 


, (5) 


where V = (v T ; v T ) is a 2 x N visibility matrix, © de¬ 
notes the element-wise multiplication, and || ■ || 2 is the 
sum of the squares of all matrix elements. Basically 
J(-, •) computes the difference between the visible 2D land¬ 
marks and their 3D projections. An alternative estima¬ 
tion scheme is utilized, i.e., by assuming p° = 0, we 
estimate M fc = argmin M J(M,p fc_1 ), and then p k = 
argmin p J(M fc , p) iteratively until the changes on M and 
p are small enough. Both minimizations can be efficiently 
solved in closed forms via least-square error. 

3.2. Cascaded Coupled-Regressor 

For each training image I, : , we now have its ground 
truth as P, = {M,. p,}, as well as their initialization, i.e., 
Mj = < 7 (M,bj) and p° = 0. Here M is the average of 


ground truth projection matrices in the training set, b, is 
a 4-dim vector indicating the bounding box location, and 
g(M, b) is a function that modifies the scale and translation 
of M based on b. Given a dataset of N,j training images, 
the question is how to formulate an optimization problem to 
estimate P,. We decide to extend the successful cascaded 
regressors framework due to its accuracy and efficiency [5], 
The general idea of cascaded regressors is to learn a series 
of regressors, where the kth regressor estimates the differ¬ 
ence between the current parameter P k ~ 1 and the ground 
truth Pi, such that the estimated parameter gradually ap¬ 
proximates the ground truth. 

Motivated by this general idea, we adopt a cascaded 
coupled-regressor scheme where two regressors are learned 
at the A'th cascade layer, for the estimation of M, and p, 
respectively. Specifically, the first learning task of the kth 
regressor is. 


0 j = argmin 
e* 


N d 
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, (6) 


where 


N s 


U i = M*" 1 So+^pJ-'Si 


2=1 


(7) 


is the current estimated 2D landmarks, AM^ = M, — 
M^” 1 , and R k (-;Q k ) is the desired regressor with the 
parameter of (~) k . After 0*' is estimated, we obtain 
AMj = //((■; 0 (j for all training images and update 
= M^ -1 + AM,. Note that this liner updating may 
potentially break the constraint of the projection matrix. 
Therefore, we estimate the scale and yaw, pitch, row an¬ 
gles (s, a, /3, 7 ) from and compose a new Mf based on 
these four parameters. 

Similarly the second learning task of the fcth regressor is. 


N d 


0* = argminp || Apf - R k {l u Uj, vf ; 0f )| | 2 , (8) 

®2 j =1 


where Uj is computed via Eq 7 except M -' _1 is replaced 
with Mf. We also obtain Apj = R^-^Q^) to all train¬ 
ing images and update p k = p k + Apj. This iterative 
learning procedure continues for I\ cascade layers. 
Learning R k (-) Our cascaded coupled-regressor scheme 
does not depend on the particular feature representation or 
the type of regressors. Therefore, we may define them based 
on the prior work or any future development in features and 
regressors. Specifically, in this work we adopt the HOG- 
based linear regressor [34] and the fern regressor [4], 

For the linear regressor, we denote a function /(I, U) to 
extract HOG features around a small rectangular region of 
each one of N landmarks, which returns a 32A - -dim feature 
vector. Thus, we define the regressor function as 


77(0 = © T ’ Diag*(vj)/(Ij, Uj), 


(9) 






where Diag* (v) is a function that duplicates each element 
of v 32 times and converts into a diagonal matrix of size 
32 N. Note that we also add a constraint, A||0| | 2 , to Eq 6 or 
Eq 8 for a more robust least-square solution. By plugging 
Eq 9 to Eq 6 or Eq 8, the regressor parameter 0 (e.g., a 
N s x 321V matrix for R k ) can be easily estimated in the 
closed form. 

For the fern regressor, we follow the training procedure 
of [4], That is, we divide the face region into a 3 x 3 grid. 
At each cascade layer, we choose 3 out of 9 zones with the 
least occlusion, computed based on the {v( : }. For each 
selected zone, a depth 5 random fern regressor is learned 
from the shape-index features selected by correlation-based 
method [5] from that zone only. Finally the learned R(-) is 
a weighted mean voting from the 3 fern regressors, where 
the weight is inversely proportional to the average amount 
of occlusion in that zone. 

3.3. 3D Surface-Enabled Visibility 

Up to now the only thing that has not been explained in 
the training procedure is the visibility of the projected 2D 
landmarks, v,. It is obvious that during the testing we have 
to estimate v at each cascade layer for each testing image, 
since there is no visibility information given. As a result, 
during the training procedure, we also have to estimate v 
per cascade layer for each training image, rather than using 
the manually labeled ground truth visibility that is useful for 
estimating ground truth P as shown in Eq 5. 

Depending on the camera projection matrix M, the vis¬ 
ibility of each projected 2D landmark can dynamically 
change along different layers of the cascade (see the top- 
right part of Fig. 2). In order to estimate v, we decide to 
use the 3D face surface information. We start by assuming 
every individual has a similar 3D surface normal vector at 
each of its 3D landmarks. Then, by rotating the surface nor¬ 
mal according to the rotation angle indicated by the projec¬ 
tion matrix, we can know that whether the coordinate of the 
z-axis is pointing toward the camera (i.e., visible) or away 
from the camera (i.e., invisible). In other words, the sign of 
the z-axis coordinates indicates visibility. 

By taking a set of 3D scans with manually labeled 3D 
landmarks, we can compute the landmarks’ average 3D sur¬ 
face normals, denoted as a 3 x N matrix N. Then we use 
the following equation to compute the visibility vector. 


v = N 1 • 


mi 

ll m i|| 



( 10 ) 


where mi and m 2 are the left-most three elements at the 
first and second row of M respectively, and 11 • 11 denotes 
the L 2 norm. For fern regressors, visa soft visibility within 
±1. For linear regressors, we further compute v = 1(1 + 
sign(v)), which results in a hard visibility of either 1 or 0. 

In summary, we present the detailed training procedure 
in Algorithm 1 . 


Algorithm 1: The training procedure of PIFA. 

Data: 3D model {{S}^f 0 , N}, training samples and labels 

|i ■ 1: ■ b. !; v : . 

Result: Cascaded coupled-regressor parameters 

{eJ.eSjJu. 

1 foreach i = 1, • • ■ , Na do 

2 Estimate M, and p, via Eq. 5; 

3 M- = g(M, b,), pi = 0 and v° = 1 ; 


4 

5 

6 

7 

8 
9 

10 


foreach k = 1, • ■ • , K do 

Compute Ui via Eq 7 for each image ; 
Estimate Oi via Eq 6 ; 

Update M" and 1+ for each image ; 
Compute V; via Eq 10 for each image ; 
Estimate 02 via Eq 8 ; 

Update p k for each image ; 


11 return {#?(•; 0j),.R£(-;e£)}f = i. 


Model Fitting Given a testing image I and its initial pa¬ 
rameter M° and p°, we can apply the learned cascaded 
coupled-regressor for face alignment. Basically we iter¬ 
atively use R k (--Q k ) to compute AM, update M fc , use 
R k (■: © 2 ) to compute Ap, and update p k . Finally the es¬ 
timated 3D landmarks are S = So + (+, pf&i, and the 
estimated 2D landmarks are U = M A S. Note that S car¬ 
ries the individual 3D shape information of the subject, but 
not necessary in the same pose as the 2D testing image. 

4. Experimental Results 

Datasets The goal of this work is to advance the capabil¬ 
ity of face alignment on in-the-wild faces with all possible 
view angles, which is the type of images we desire when se¬ 
lecting experimental datasets. However, very few publicly 
available datasets satisfy this characteristic, or have been 
extensively evaluated in prior work (see Tab. 1). Neverthe¬ 
less, we identify three datasets for our experiments. 

AFFW dataset [15] contains ~25 k in-the-wild face im¬ 
ages, each image annotated with the visible landmarks (up 
to 21 landmarks), and a bounding box. Based on our es¬ 
timated M for each image, we select a subset of 5, 300 
images where the numbers of images whose absolute yaw 
angle within [0°,30°], [30°,60°], [60°,90°] are roughly ^ 
each. To have a more balanced distribution of the left 
vs. right view faces, we take the odd indexed images among 

5, 300 (i.e., 1st, 3rd,...), flip the images horizontally, and 
use them to replace the original images. Finally, a random 
partition leads to 3, 901 and 1, 299 images for training and 
testing respectively. Note that from Tab. 1, it is clear that 
among all methods that test on all poses, we have the largest 
number of testing images. 

AFW dataset [44] contains 205 images and in total 468 
faces with different poses within ±90°. Each image is la- 









beled with visible landmarks (up to 6), and a face bounding 
box. We only use AFW for testing. 

Since we are also estimating 3D landmarks, it is im¬ 
portant to test on a dataset with ground truth , rather 
than estimated, 3D landmark locations. We find BP4D-S 
database [41] to be the best for this purpose, which con¬ 
tains pairs of 2D images and 3D scans of spontaneous fa¬ 
cial expressions from 41 subjects. Each pair has semi- 
automatically generated 83 2D and 83 3D landmarks, and 
the pose. We apply a random perturbation on 2D landmarks 
(to mimic imprecise face detection) and generate their en¬ 
closed bounding box. With the goal of selecting as many 
non-frontal view faces as possible, we select a subset where 
the numbers of faces whose yaw angle within [0°,10°], 
[10°, 20°], [20°, 30°] are 100, 500, and 500 respectively. 
We randomly select half of 1,100 images for training and 
the rest for testing, with disjoint subjects. 

Experiment setup Our PIFA approach needs a 3D model 
of {S}^7 0 and N. Using the BU-4DFE database [36] that 
contains 606 3D facial expression sequences from 101 sub¬ 
jects, we evenly sample 72 scans from each sequence and 
gather a total of 72 x 606 scans. Based on the method in 
Sec. 3.1, the resultant model has N s = 30 for AFFW and 
AFW, and N s = 200 for BP4D-S. 

During the training and testing, for each image with a 
bounding box, we place the mean 2D landmarks (learned 
from a specific training set) on the image such that the land¬ 
marks on the boundary overlap with the four edges of the 
box. For training with linear regressors, we set K = 10, 
A = 120, while I\ = 150 for fern regressors. 

Evaluation metric Given the ground truth 2D landmarks 
U,, their visibility v,, and estimated landmarks U, of 
Nt testing images, we have two ways of computing the 
landmark estimation errors: 1) Mean Average Pixel Error 
(MAPE) [37], which is the average of the estimation errors 
for visible landmarks, i.e., 

, N t ,N 

MAPE = N w(i)||U i (:,j)-U i (:,j)ll, 

Ei K*|i tj 

(11) 

where v, 1 1 is the number of visible landmarks of image 
Ij, and Uj(:,j) is the yth column of Uj. 2) Normalized 
Mean Error (NME), which is the average of the normalized 
estimation error of visible landmarks, i.e.. 


i M 1 N 

mE = w, £ W £•- u, ( ,i)ii). 


( 12 ) 

where ci, is the square root of the face bounding box size, as 
used by [37], Note that normally di is the distance of two 
centers of eyes in most prior face alignment work dealing 
with near-frontal face images. 

Given the ground truth 3D landmarks S,; and estimated 


Table 2: The NME(%) of three methods on AFLW. 


N t 

PIFA 

CDM 

RCPR 

1,299 

6.52 


7.15 

783 

6.08 

8.65 



landmarks S. t , we first estimate the global rotation, trans¬ 
lation and scale transformation so that the transformed S,, 
denoted as S', has the minimum distance to S,. We then 
compute the MAPE via Eq 11 except replacing U and U, 
with S' and S,, and v, = 1. Thus the MAPE only measures 
the error due to non-rigid shape estimation, rather than the 
pose estimation. 

Choice of baseline methods Given the explosion of face 
alignment work in recent years, it is important to choose ap¬ 
propriate baseline methods so as to make sure the proposed 
method advances the state of the art. In this work, we se¬ 
lect three recent works as baseline methods: 1) CDM [37] 
is a CLM-type method and the first one claimed to perform 
pose-free face alignment, which has exactly the same ob¬ 
jective as ours. On AFW it also outperforms the other well- 
known TSPM method [44] that can handle all pose faces. 
2) TCDCN [42] is a powerful deep learning-based method 
published in the most recent ECCV. Although it only es¬ 
timates 5 landmarks for up to ~60° yaw, it represents the 
most recent development in face alignment. 3) RCPR [4] 
is a regression-type method that represents the occlusion- 
invariant face alignment. Although it is an earlier work than 
CoR [38], we choose it due to its superior performance on 
the large COFW dataset (see Tab. 1 of [38]). It can be seen 
that these three baselines not only are most relevant to our 
focus on pose-invariant face alignment, but also well rep¬ 
resent the major categories of existing face alignment algo¬ 
rithms based on [30], 

Comparison on AFLW Since the source code of RCPR 
is publicly available, we are able to perform the training 
and testing of RCPR on our specific AFLW partition. We 
use the available executable of CDM to compute its per¬ 
formance on our test set . We strive to provide the same 
setup to the baselines as ours, such as the initial bounding 
box, regressor learning, etc. For our PIFA method, we use 
the fern regressor. Because CDM integrates face detection 
and pose-free face alignment, no bounding box was given 
to CDM and it successfully detects and aligns 783 out of 
1,299 testing images. Therefore, to compare with CDM, 
we evaluate the NME on the same 783 testing images. As 
shown in Tab. 2, our PIFA shows superior performance to 
both baselines. Although TCDCN also reports performance 
on a subset of 3, 000 AFLW images within ±60° yaw, it 
is evaluated with 5 landmarks, based on NME when di is 
the inter-eye distance. Hence, without the source code of 
TCDCN, it is difficult to have a fair comparison on our sub¬ 
set of AFLW images (e.g., we can not define di as the inter- 












Table 3 : The comparison of four methods on AFW. 
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Figure 3: The NME of five pose groups for two methods. 

eye distance due to profile view images). On the 1, 299 test¬ 
ing images, we also test our method with linear regressors, 
and achieve a NME of 7.50, which shows the strength of 
fern regressors. 

Comparison on AFW Unlike our specific subset of 
AFLW, the AFW dataset has been evaluated by all three 
baselines, but different metrics are used. Therefore, the re¬ 
sults of the baselines in Tab. 3 are from the published pa¬ 
pers, instead of executing the testing code. One note is that 
from the TCDCN paper [42], it appears that all 5 landmarks 
are visible on all displayed images and no visibility estima¬ 
tion is shown, which might suggest that TCDCN was eval¬ 
uated on a subset of AFW with up to ±60° yaw. Hence, 
we select the total of 313 out of 468 faces within this pose 
range and test our algorithm. Since it is likely that our sub¬ 
set could differ to [42], please take this into consideration 
while comparing with TCDCN. Overall, our PIFA method 
still performs favorably among the four methods. This is 
especially encouraging given the fact that TCDCN utilizes 
a substantially larger training set of 10, 000 images - more 
than two times of our training set. Note that in addition 
to Tab. 2 and 3, our PIFA also has other benefits as shown 
in Tab. 1. E.g., we have 3D and visibility estimation, while 
RCPR has no 3D estimation and TCDCN does not have vis¬ 
ibility estimation. 

Estimation error across poses Just like pose-invariant 
face recognition studies the recognition rate across 
poses [6], we also like to study the performance of face 
alignment across poses. As shown in Fig. 3, based on the 
estimated projection matrix M and its yaw angles, we parti¬ 
tion all testing images of AFFW into five bins, each around 
a specific yaw angle. Then we compute the NME of testing 
images within each bin, for our method and RCPR. We can 
observe that the profile view images have in general larger 
NME than near-frontal images, which shows the challenge 
of pose-invariant face alignment. Further, the improvement 
of PIFA over RCPR is consistent across most of the poses. 
Estimation error across landmarks We are also inter- 
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Figure 4: The NME of each landmark for PIFA. 



Figure 5: 2D and 3D alignment results of the BP4D-S dataset. 
Table 4: Efficiency of four methods in FPS. 


PIFA 

CDM 

RCPR 

TCDCN 

3.0 

0.2 

3.0 

58.8 


ested in the estimation error across various landmarks, un¬ 
der a wide range of poses. Hence, for the AFFW test set, 
we compute the NME of each landmark for our method. As 
shown in Fig. 4, the two eye region has the least amount 
of error. The two landmarks under the ears have the most 
error, which is consistent with our intuition. These obser¬ 
vations also align well with prior face alignment study on 
near-frontal images. 

3D landmark estimation By performing the training and 
testing on the BP4D-S dataset, we can evaluate the MAPE 
of 3D landmark estimation, with exemplar results shown 
in Fig. 5. Since there are limited 3D alignment work and 
many of which do not perform quantitative evaluation, such 
as [12], we are not able to find another method as the base¬ 
line. Instead, we use the 3D mean shape. So as a baseline 
and compute its MAPE with respect to the ground tmth 3D 
landmarks Si (after global transformation). We find that 
the MAPE of So baseline is 5.02, while our method has 
4.75. Although our method offers a better estimation than 
the mean shape, this shows that 3D face alignment is still 
a very challenging problem. And we hope our efforts to 
quantitatively measure the 3D estimation error, which is 
more difficult than its 2D counterpart, will motivate more 
research activities to address this challenge. 
Computational efficiency Based on the efficiency reported 
in the publications of baseline methods, we compare the 





























































Figure 6: Testing results of AFLW (top) and AFW (bottom). As shown in the top row, we initialize face alignment by placing a 2D mean 
shape in the given bounding box of each image. Note the disparity between the initial landmarks and the final estimated ones, as well as 
the diversity in pose, illumination and resolution among the images. Green/red points indicate visible/invisible estimated landmarks. 


computational efficiency of four methods in Tab. 4. Only 
TCDCN is measured based on the C implementation while 
other three are all based on Matlab implementation. It can 
be observed that TCDCN is the most efficient one. Our un¬ 
optimized implementation has a reasonable speed of 3 FPS 
and we believe this efficiency can be substantially improved 
with optimized C implementation. 


Qualitative results We now show the qualitative face 
alignment results for images in three datasets. As shown 
in Fig. 6, despite the large pose range of ±90° yaw, our 
algorithm does a good job of aligning the landmarks, and 
correctly predict the landmark visibilities. These results 
are especially impressive if you consider the same mean 
shape (2D landmarks) is used as the initilization of all test¬ 
ing images, which has very large deformations with respect 
to their final estimation landmarks. 


5. Conclusions 

Motivated by the fast progress of face alignment tech¬ 
nologies and the need to align faces at all poses, this paper 
draws attention to a relatively unexploited problem of face 
alignment robust to poses variation. To this end, we pro¬ 
pose a novel approach to tightly integrate the powerful cas¬ 
caded regressor scheme and the 3D deformable model. The 
3DMM not only serves as a compact constraint, but also of¬ 
fers an automatic and convenient way to estimate the visibil¬ 
ities of 2D landmarks - a key for successful pose-invariant 
face alignment. As a result, for a 2D image, our approach 
estimates the locations of 2D and 3D landmarks, as well 
as their 2D visibilities. We conduct an extensive experi¬ 
ment on a large collection of all-pose face images and com¬ 
pare with three state-of-the-art methods. While superior 2D 
landmark estimation has been shown, the performance on 
3D landmark estimation indicates the future direction to im¬ 
prove this line of work. 
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